Skip to main content

Classification

We may have a classification problem for a qualitative result from an unordered set CC. The main goal of us is to:

  • build a classifier f^:RpC\hat f: \R^p \to C to assigns a future observations xRpx\in \R^p to a class label. To obtain this function, similarly as the MSE, we have to minimize the expected error rate: E[1{Yf^(X)}]E[1\{Y\ne \hat f (X)\}]

The best classifier is known as the Bayes classifier denote it as ff^{*} which is the classifier we aim for. It defined as X=x,f(x)=j\forall X= x, f^{*}(x) = j if j=argmaxkCP[Y=kX=x]j = \arg\max\limits_{k\in C} P[Y=k | X= x]. It always exists in classification problems.

  • E[1{Yf^(X)}X=x]=1maxkCP[Y=kX=x]E[1\{Y\ne \hat f (X)\}| X = x] = 1 - \max\limits_{k\in C} P[Y=k | X= x] is the smallest expected error rate among all which is the same as the irreducible error (also known as the bayes error rate).

The reason why we can't use the common regression way (i.e. OLS) to do classification is that for a given feature we have E[YX]=β0+βXE[Y|X] = \beta_0 + \beta X where the coefficients from the regression esitimation which might not what we want.

We have many methods to do classification such as logistic regression, discriminant analysis, KNN, SVM, decision tree, random forest, boosting and neural network.

For binary classification problem y{1,+1}y\in \{-1, +1\}, we have the linear decision boundary wTx+b=0w^Tx + b = 0 for some weights wRpw\in \R^p and bRb\in \R.

  • A good decision boundary satisfy wTx+b>0    y=+1w^Tx+b > 0 \iff y = +1 and wTx+b<0    y=1w^Tx+b < 0 \iff y = -1.
  • we estimate the ww and bb by minimize iMyi(wTxi+b)-\sum_{i\in M} y_i (w^Tx_i + b) where MM is the set of misclassified points.

When the data is separable on higher dimension, we may have multiple solutions of ww and bb. That is, we have optimal separating hyperplane where is a hyperplane that separates two classes and maximizes the distance to the closest point from either class.

  • the decision hyperplane is orthogonal to ww where x1,x2\forall x_1, x_2 on the hyperplane, wT(x1x2)=0w^T(x_1 - x_2) = 0.
  • we define w=ww2w^* = \frac{w}{\|w\|_2} is a unit vector pointing in the same direction as ww, same hyperplane will produce same ww^*.
  • x\forall x, there exists a xprojx_{\text{proj}} on the hyperplane which is the closest point to xx from hyperplane. If we project xxprojx-x_{\text{proj}} onto w/w2w/\|w\|_2, we get (xxproj)Tw/w2=xTwxprojTww2=xTw+bw2|(x-x_{\text{proj}})^Tw/\|w\|_2| = \frac{x^Tw - x_{\text{proj}}^Tw}{\|w\|_2} = \frac{x^Tw + b}{\|w\|_2}.
  • Then we have a margin constraints yi(xiTw+b)My_i(x_i^Tw + b) \ge M for all ii where wTx+bMw^Tx + b\ge M if y=+1y=+1 and wTx+bMw^Tx + b\le -M if y=1y=-1. M>0M > 0
  • We want this margin to be as large as possible so that we can have a good classifier (no predictor fall into the margin). The width of margin define as xTw+bw2=Mw2\frac{|x^Tw + b|}{\|w\|_2}= \frac{M}{\|w\|_2}.
    • max Mw2\frac{M}{\|w\|_2} is the same as min w22M\frac{\|w\|_2^2}{M}. We can finally set M=1M = 1 W.L.O.G.

Comparison of Classification Methods

  • SVM is more similar as LR than LDA
  • SVM does not estimate the probabilitiees P[Y=kX=x]P[Y=k | X= x], but LDA and LR do.
  • When classes are (nearly) separable, SVM and LDA perform better than LR
  • When classes are non-separable, LR (with ridge penalty) and SVM are very similar
    • log LR is linear so that we can further do ridge regression
  • When Gaussianity can be justified (such as normal assumption is true), LDA has best performance
  • SVM annd LR are less used for multi-class classification